| I confirm that I have fully read and understood the assignment brief for this module. | Y |
| Name: Chung Yan | Surname: Yu |
| Submission Date | 19-10-25 |
| Word Count: whole assignment including codes | 8,972 |
| Word Count: main body excluding abstract, references and supplementary materials | 3,109 |
This assignment is the result of my own work and includes nothing which is the outcome of work done in collaboration except as declared in the Preface and specified in the text. It is not substantially the same as any that I have previously submitted for a degree or diploma or other qualification at the University of Cambridge or any other university or similar institution, or that is being concurrently submitted, except as declared in the Preface and specific in the text. I further state that no substantial part of my Portfolio has already been submitted, nor is being concurrently submitted for any such degree, diploma or other qualification at the University of Cambridge or any other university or similar institution except as declared in the Preface and specified in the text.
| I confirm the statement of originality as above | Y |
Self-assessment is an important aspect of feedback literacy, which is, in turn, key to the development of expertise. As you proceed through the MSt Healthcare data science programme, we hope that you will make use of the following prompts to assess your own work on assignments. Specific assignment briefs will likely indicate which of these to address for which assessments, but, in general, we expect you to respond to one or two for each assignment on your course.
For each of the questions, do not spend too long answering – keep it brief. For each question you answer, limit yourself to no more than three items. And please remember, this is optional and developmental: these cover sheets are designed to create space for self-assessment and feedback dialogue, rather than additional assignment workload.
| Which permitted use of generative AI are you acknowledging? | Semantic search of literature and notes, outline creation, code debugging, output formatting and generation, informed feedback |
| Which generative AI tool did you use (name and version)? | Claude Code v2.0.3x (the version likely changed over usage), Claude Desktop v1.0.3x with PubMed Connector, M365 CoPilot |
| What did you use the tool for? | Searching for relevant journal publications, sanity check on coding strategies, collating notes in my Obsidian vault, debugging code by parsing error messages |
| How have you used or changed the generative AI’s output | E.g. my collated notes always stay in point form so that I write out the paragraphs in my own words. Feedback provided by the GenAI models are weighed and assessed before being acted on. Code generated are checked, e.g. it tried to run Levene test on a model fitted using residuals as the response, this was rectified back to the relevant response variable. |
The GitHub repository of this project is found at https://github.com/joelyu/HDS-W1.
Diffuse large B cell lymphoma (DLBCL) is the most common non-Hodgkin lymphoma, with prognosis influenced by demographic features (age, gender), integrated scoring systems such as the International Prognostic Index (IPI) and genetic alterations. Oncogenes MYC, BCL2, BCL6 emerged with strong DLBCL associations, and further studies have identified over 150 genetic drivers. Yet whether these genetic markers’ effects differ across demographic subgroups remains understudied. This study analysed the gene expression data of 21 genes from 775 DLBCL patients to investigate whether age and gender modify gene expression-response associations. Hierarchical logistic regression models tested demographic-gene interactions across 21 genes and 2 demographic factors (age and gender, totalling 42). Only CCND2 showed significant interaction effects (Likelihood ratio test adjusted p = 0.006) with age on predicting complete response. Older age groups showed significant increase (61-75 odds ratio p = 0.014; >75 odds ratio p = 0.001) in complete response probability with higher CCND2 expression, while younger patients (≤40, 41-60) showed the opposite insignificant trend. Gender revealed no main or interaction effects across all genes. MYC showed significant prognostic effects (likelihood ratio test adjusted p = 0.010) with response completeness that were consistent across all demographic groups, with no interaction effects. With only 1 of 42 tested demographic-gene combinations demonstrating interaction effects, the demographic modifying effects remain a rare phenomenon amongst genetic markers examined. This suggests demographic stratification may generally be unnecessary for genetic markers. The age-dependent effects of CCND2 could however provide a starting point to reconcile conflicting prognostic associations of CCND2 expression levels in existing literature, although further validation in other independent cohorts and mechanistic confirmation are required.
Diffuse large B cell lymphoma (DLBCL) is the most common type of aggressive non-Hodgkin’s lymphoma (NHL)1,2, with a wide range of factors contributing to prognosis. For a comparable and comprehensive pre-treatment prognosis model, the International Prognostic Index (IPI) emerged as a standardised model for DLBCL and other aggressive NHLs3, combining 5 factors: age, Ann Arbor stage classification, ECOG Performance Status, serum lactate dehydrogenase (LDH) level and the number of extranodal sites of disease. By pooling factors together, the IPI outperformed simple disease stage classification models such as Ann Arbor. Then came the National Comprehensive Cancer Network International Prognostic Index (NCCN-IPI)4 as an update to the IPI, with the key innovation of using cubic splines to model the continuous factors of age and LDH levels as higher resolution categorical factors, modifying them from binary factors to quaternary and ternary factors respectively. As prognosis models improved, so did treatment. The first generation chemotherapy of cyclophosphamide, doxorubicin, vincristine, and prednisone (CHOP) had a cure rate of 30-35%5. With the addition of rituximab to the regimen (R-CHOP) and other advances, DLBCL becomes curable for more than 60% of patients6.
While age is incorporated into the IPI (and subsequent NCCN-IPI), other demographic factors such as gender are not, despite established evidence for their prognostic involvement. Males demonstrate higher hazard ratios7 and gender-associated pharmacokinetic differences in rituximab lead to worse treatment responses in males8. These demographic effects raise questions about whether risk factors perform uniformly across patient subgroups. Beyond demographic and clinical factors, a series of genetic alterations have emerged as strong prognostic markers: BCL29, BCL610 and MYC11. These changes are observed at both the genotypic and phenotypic levels, as translocation mutations and protein overexpression respectively12. Furthermore, patients with varying combinations of these alterations have shown worse responses to R-CHOP treatments, highlighting the clinical relevance of genomic profiling. Some genetic markers have shown age-associated expression patterns, and certain genetic drivers (BCL6, IRF4) lose prognostic significance when age is incorporated in multivariate models13. This indicates that demographic factors may not merely add to genomic risk additively, but could modify how these genomic drivers influence clinical outcomes.
With the advent of machine learning methods came attempts to construct models based on clinical and genomic data. Reddy et al.14 performed whole exome and transcriptome sequencing in 1001 DLBCL patients, identifying 150 genetic drivers including BCL2, BCL6, and MYC alongside numerous newly characterised candidates. By combining genotypic alterations, gene expression data and cell-of-origin classification, they developed a genomic risk model that outperformed the IPI. However, the potential for demographic factors to modify these genomic associations remains understudied. Given that age and gender both demonstrate prognostic significance and influence treatment response, it remains unknown whether genomic risk markers perform uniformly across demographic subgroups. If these demographic factors modify how gene expression levels impact treatment response, such that the same gene expression levels in different gender or age groups lead to different prognostic outcomes, it would be crucial that these differences are identified and applied in future models to prevent inaccurate prognoses.
This study addresses this gap by investigating whether age and gender interact and modify the prognostic associations between gene expression markers and therapy response in DLBCL. The associations between the demographic factors (age & gender) with complete therapy response and the gene expression levels of 21 genes were first considered. Logistic regression models using gene expression levels of 21 genes were fitted with therapy response, and the interaction of demographic factors with these gene expression levels were examined. Exploring how demographic factors interact with known and potential genomic drivers of DLBCL could reveal age or gender-dependent tumour biology that inform therapeutic approaches.
The analysis utilises clinical and gene expression data15 from Reddy et al.’s genomic risk model study14 of 1001 DLBCL patients, focusing on 775 patients with complete gene expression profiles for 21 genes including the big three oncogenic drivers (MYC, BCL2, BCL6) and cell-of-origin classifiers. Detailed data breakdown and processing are documented in the Methods section.
The datasets were publicly available on the ScienceDirect webpage
of Reddy et. al’s study14, and the corresponding
author was contacted via email to inform them of this secondary
analysis. A public GitHub
repository was created to host the processed datasets, attribution,
license information and supplementary materials to ensure data openness,
transparency and reproducibility. For the original study, the authors
obtained anonymised patient clinical information and tumours, which were
processed in line with a protocol approved by the Institutional Review
Board at Duke University. Thus no further anonymisation is required. A
total number of 2 datasets were downloaded as Excel files directly from
their links on the aforementioned ScienceDirect site (Table
S1 & Table
S2) using the download.file function from R’s
utils package16 and read via the
readxl package17. The specific sheets were
selected then cleaned to be exported as .csv files (see 01-data-preprocessing.R18,19). The processed .csv
files were uploaded to this project’s GitHub repository: mmc1-ClinicalInformation.csv
(raw), mmc2-GeneExpression.csv
(raw). This RMarkdown document is self-sufficient and contains code
for the relevant analysis and plots shown only, for a complete analysis
including assumption testing, post-hoc analysis, tests that showed
insignificant results, please refer to the sequenced scripts in the supplementary-materials
folder of this repository. This document also supports sourcing the
datasets from 3 sources for redundancy: original ScienceDirect links,
processed .csv files on GitHub and local processed
.csv files when this repository is downloaded. For this
knitted20 HTML document, both datasets were
loaded directly from ScienceDirect’s supplementary materials: S1
Table (clinical data) and S2
Table (gene expression data).
The 2 datasets were joined using common patient IDs and 2 new columns
were added. age_group_nccn was obtained by transforming the
age at diagnosis variable into 4 categorical groups according to cutoffs
defined by NCCN-IPI4: ≤40, 41 - 60, 61 - 75,
>75. complete_response was added by combining “Partial
response” and “No response” as “Incomplete response” from
response_therapy while keeping “Complete response”
unchanged. This prepared the dataset for hierarchical binomial logistic
regression. As for duplicate gene expression columns for BCL6, the two
columns were compared to verify identicality and 1 was discarded. The
resulting table was one with 775 patients and 27 variables, of which
include the anonymised patient ID; demographic variables such as gender
(binary: F/M), age (continuous, at diagnosis) and age group (quaternary,
as defined by NCCN-IPI); clinical information of therapy response
(ternary: Complete response, Partial response, No response); and the
log2 transformed gene expression levels of 21 genes: MYC, BCL2, BCL6,
ITPKB, MME, MYBL1, DENND3, NEK6, LMO2, LRMP, SH3BP5, IRF4, PIM1, ENTPD1,
BLNK, CCND2, ETV6, FUT8, BMF, IL16 and PTPN1.
Gene expression data were limited to these 21 genes available from the original study’s datasets (S1 & S2), representing 2 categories. The first category is the “big three” oncogenic drivers of MYC, BCL2 and BCL6, serving as strong prognostic markers. Moreover, MYC21 and BCL613 have both shown age-dependent prognostic effects, suggesting age may modify how these markers influence clinical outcomes. The second category comprises 19 cell-of-origin classifier genes (DLBCL subtypes: ABC vs GCB)22, with BCL6 appearing in both categories. Despite their initial utility for classification, genes such as CCND2 and LMO223 have shown to be strong survival predictors alongside BCL2 and BCL6. More importantly, IRF4 is among the genetic features showing age associations and contributes to genetic complexity that loses prognostic significance when age is incorporated13. Both categories therefore contain genes with potential to interact with demographic factors, making them suitable for testing whether demographic factors modify their prognostic associations.
Overall, data completeness for these 27 columns was exceptional, with
23 columns being 100%, and the remaining 4 columns all with more than
93% completeness. Processing is documented in 02-data-processing.R.
Characteristics of the cohort are summarised24 in
Table 1.
Table 1. Baseline Cohort Characteristics
| Characteristic | Total (N=775) | Complete Response | Incomplete Response |
|---|---|---|---|
| Sample size | N = 775 | n = 598 | n = 126 |
| Gender | |||
| Female | 338 (43.6%) | 258 (43.1%) | 54 (42.9%) |
| Male | 437 (56.4%) | 340 (56.9%) | 72 (57.1%) |
| Age at diagnosis, years | 61 ± 15.4 | 59.8 ± 15.7 | 63.1 ± 14.3 |
| Age groups (NCCN-IPI) | |||
| ≤40 years | 75 (10.1%) | 67 (11.5%) | 8 (6.5%) |
| 41-60 years | 249 (33.5%) | 202 (34.8%) | 41 (33.1%) |
| 61-75 years | 285 (38.3%) | 218 (37.5%) | 47 (37.9%) |
| >75 years | 135 (18.1%) | 94 (16.2%) | 28 (22.6%) |
Demographic and clinical characteristics stratified by treatment
response.
Values shown as n (%) for categorical variables and mean
± SD for continuous variables.
Treatment response categories
combine partial and no response as “Incomplete response” versus complete
response.
NCCN-IPI age groups represent National Comprehensive
Cancer Network International Prognostic Index classifications.
Percentages are calculated within each response group for gender and age
group distributions.
Baseline gene expression patterns across demographic groups were
examined using t-test (gender) and one-way Welch’s ANOVA25 (age
groups) with false discovery rate (FDR)26 correction
due to the number of genes analysed (see 04-data-analysis.R).
Post-hoc Games-Howell tests were conducted for significant ANOVA
results27. Assumptions for these tests
were checked with diagnostic tests and plots, and the overall
distribution of gene expression against age and gender visually examined
(see 03-data-exploration.R).
Second, associations between therapy response completeness and
demographic factors were examined with the Chi-squared tests, with
expected frequencies checked (> 5) (see 05-response-analysis.R).
The baseline information from Exploratory Statistics only describes demographic-associated patterns, but not whether demographic factors modify the prognostic associations of gene expression. To systematically isolate and analyse these prognostic associations of gene expression, demographic factors and the interaction between the two, hierarchical logistic regression models28 were fitted for all 21 genes and 2 demographic factors (42 combinations). Each combination had four logistic regression models fitted to identify significant terms that contribute to therapy response completeness, the models were nested as:
All fitted models were analysed with likelihood-ratio test (LRT) to
obtain FDR-corrected p-values. Significant models were then checked for
model fit with goodness-of-fit Chi-square tests and Akaike Information
Criterion (AIC) scores, along with influential observations and
collinearity29 (see 06-effect-analysis.R).
Gene expression levels showed no significant differences across gender groups after t-test with FDR correction. Ten genes showed significantly different expression levels after Welch’s ANOVA with FDR correction across NCCN-IPI age groups: BCL2 (p = 0.0002), BLNK (p = 0.003), MYBL1 (p = 0.010), SH3BP5 (p = 0.016), ITPKB (p = 0.020), CCND2 (p = 0.020), PTPN1 (p = 0.031), BMF (p = 0.034), FUT8 (p = 0.037), LMO2 (p = 0.038). Post-hoc analysis with Games-Howell revealed that age groups mostly clustered into two statistically distinct groups for each gene, save for BLNK with three groups Figure 130–32.
Figure 1. Gene Expression Patterns Across Age Groups.
Fig. 1
Box plots show gene expression
distribution by age group for genes with significant Welch’s ANOVA
results (FDR-adjusted p < 0.05).
Compact letter display (CLD)
letters (a, b, c) above each boxplot indicate statistical groupings from
Games-Howell post-hoc tests within each gene; groups with different
letters differ significantly (p < 0.05).
Sample sizes (n) are
shown below each age group.
Outliers are highlighted in darker
borders.
Neither gender (p = 1.000) nor age group (p = 0.173) showed significant associations with complete response, although incomplete response proportions did trend upwards as age increased. While neither demographic factor showed prognostic effect on treatment response, the key question whether they modify the gene expression-therapy response association is addressed in the section below.
Systematic testing of 21 genes across both age and gender yielded 42 gene-demographic combinations, with each combination having 4 nested models. For a full breakdown of these nested models per gene, please refer to the Methods section above.
No gene-gender combinations showed significant effect, with neither gene-gender interaction nor gender main effect models attaining significant FDR adjusted p-values from LRT testing for any gene.
As part of fitting hierarchical models for each gene with either age or gender groups, gene-only models predicting therapy response (i.e. no demographic modification) were also fitted (Figure 2). One gene out of 21, MYC, displayed significant gene-only effects in both MYC-age (LRT FDR-adjusted p-value = 0.010, AIC = 647.57, Chi-square goodness-of-fit p-value = 0.947) and MYC-gender (LRT FDR-adjusted p-value = 0.010, AIC = 661.00, Chi-square goodness-of-fit p-value = 0.960) combinations. The AIC score difference between the age and gender models could be explained by the difference in data completeness for age (\(n_{\text{age}}=705\)) and gender (\(n_{\text{gender}}=724\)) groups. As gender showed neither main nor interaction effects, and the gene-only model AIC scores were both the lowest of each respective demographic combinations, this indicated a robust gene-only effect on prognostic associations that is consistent across age and gender groups (Figure 3a), where higher MYC expression predicts lower complete response probability.
Figure 2. Gene-only Effects on Complete Response
Fig. 2
Orange: Adjusted significant
Blue:
Adjusted insignificant
Lilac: Insignificant
Confidence interval:
95%
MYC is the only gene showing significant gene effect after FDR
correction, being well clear of the 1.0 reference (no effect) reference
line even when considering its CI error bars.
Odds ratio shown here
are calculated from gene-age combinations.
For age, only the CCND2 gene showed significant (LRT FDR-adjusted p-value = 0.006, AIC = 643.12, Chi-square goodness-of-fit p-value = 0.973) effects amongst the interaction models of all 21 genes, and the rest 20 genes showed neither main (additive) nor interactive effects with age. Having the lowest AIC score (min delta = 12.7) amongst the 4 CCND2 models indicate that response completeness is best predicted with gene expression when its interaction with age is factored in.
Clear age-dependent directional effects emerge for CCND2 (Figure 3b), with higher CCND2 expression levels affecting the response completeness of the age groups differently. The curves for age groups ≤40 and 41-60 and the curves for age groups 61-75 and >75 went in opposite directions. Complete response probability decreases with CCND2 expression level for the 2 younger groups, while complete response probability increases with CCND2 expression level for the 2 older groups. Further scrutinising the odds ratio33 of the 4 age groups (Figure 4) confirms that both older age groups showed significantly higher complete response odds (61-75 p = 0.014; >75 p = 0.001), and younger age groups (≤40, 41-60) have trends for worse but statistically insignificant odds.
Figure 3. Logistic Regression Models - Gene Expression Effects on Complete Response
Fig. 3
A. Logistic regression
plot of MYC gene expression against complete response probability,
age-interaction model
B. Logistic regression plot
of CCND2 gene expression against complete response probability,
age-interaction model
The MYC gene showed no significant
age-interaction effects, contrasting the curves of the different age
groups for CCND2.
Figure 4. CCND2 Interaction Effects - Age Group Modifying Gene Expression Effects on Complete Response
Fig. 4
Odds ratio of age groups from the
CCND2-age interaction model.
Orange: Significant
Lilac:
Insignificant
Confidence interval: 95%
Both older age groups
showed significant modifying effect.
This systematic analysis tested whether demographic factors modify gene expression associations with treatment response across 21 genes and 2 demographic factors (42 total combinations). Only CCND2 showed significant age interaction effects, while MYC showed significant gene-only effects. Gender revealed neither main nor interaction effects for any gene. The rarity (1 out of 42 combinations) suggests demographic stratification could be generally unneeded for most genetic markers, with CCND2 representing an exception pending validation in other cohorts.
CCND2’s age modifying effects significantly associate higher expression with better outcomes in older patients (61-75, >75) but younger patients have an opposite yet statistically insignificant trend. This represents a novel finding with no precedent in the literature. While it is established that CCND2 is involved in the cell cycle and is also a target for the MYC gene34, no prior studies have demonstrated age-dependent prognostic association, and the underlying biological mechanism remains unclear and requires investigation. This finding also provides a possible opening to reconcile conflicting CCND2 literature, where studies have reported both positive35 and negative36,37 prognostic associations. These differences could be observing genuinely different effects rather than contradicting results. A reanalysis of these datasets while factoring age effects could be a good first step to validate this premise.
MYC’s consistent effects across all demographic groups validates its established oncogene status38 and its role as a universal prognostic marker, contrasting CCND2’s age-dependence. The other two of the “big three” oncogenes BCL2 and BCL6 only showed marginal gene-only effects that lost significance after FDR correction.
The lack of gender effects aligns with the findings of NCCN-IPI’s authors, where they found no substantial enhancement to their model when including gender4, and further validates gender’s exclusion from the index.
While the adequate sample size \((n=775)\) allowed for systematic testing of 42 combinations, several limitations warrant consideration. As a study on demographic effects, ethnicity information was absent, despite probable international patient involvement. Gene expression data was limited to 21 genes pre-selected for publication, missing other potential genes and interactions. It would also be of great boon to the study to be able to benchmark the analysis here with Reddy’s genomic risk model. Unfortunately, the (hyper-)parameters required to rebuild such model along with the interactive webtool for this model are no longer accessible. This highlights the challenges for reproducibility in cancer research39 and bioinformatics40 in general, and makes me appreciate the emphasis of robust replicability our lecturers had imparted on us during the first module.
Future opportunities include validating CCND2’s age interaction effects in different cohorts, and reanalysing studies with conflicting CCND2 prognostic associations with age stratification. Only after validation in multiple cohorts should mechanistic studies investigate the biological basis of age-dependent CCND2 effects, alongside broader systematic screening for demographic interactions across additional genomic markers.
This analysis of 775 DLBCL patients tested for demographic modification of gene expression associations with treatment response for 21 genes and 2 demographic factors (42 total combinations). CCND2 was the sole gene demonstrating significant age-interaction effects, with opposite directional effects in older (statistically significant) and younger patients (insignificant trend). MYC showed consistent gene-only prognostic effects across all demographic categories. Gender showed neither main nor interaction effects for any of the genes (21) tested. Given the rarity of interaction effects among the genes and demographics tested, this supports a more universal approach for genomic markers without demographic stratification. The age-dependent CCND2 effects represent a novel finding requiring validation in independent cohorts before clinical application and mechanistic investigation. This study demonstrates that age and gender modification of genomic markers is generally rare in DLBCL, suggesting a simplified approach to prognostic model development.